Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Nov 18, 2025

rocm_base: 3d74218
rocm_base: 3d74218
upstream_main: e2b53ba

jansel and others added 30 commits November 12, 2025 06:18
We probably need something similar for expand

Pull Request resolved: pytorch#167232
Approved by: https://github.com/ColinPeppler
# Motivation
Move `XPUEvent` to `c10/xpu` to keep consistent with `XPUStream`, which is already in `c10/xpu`. The most important thing is that we will leverage `XPUEven`t in our caching allocator instead of a raw sycl event.

Pull Request resolved: pytorch#158336
Approved by: https://github.com/EikanWang, https://github.com/albanD
These tests were failing since they were added in
pytorch#165381

Evidence: scroll back in HUD, on that commit they were
failing.

I'm going to (1) set the accuracy to get CI green and (2) file an issue
for this.

Pull Request resolved: pytorch#167609
Approved by: https://github.com/choijon5, https://github.com/desertfire
There are two motivating use cases for this change:
1) export (when we trace pytree calls into a graph, we don't want to accidentally trace the side effect bytecode which will pollute the initial state) -> We want to warn about side effects and don't want to actually apply them
2) VLLM -> They want to detect side effects and error out.

We implement this with two configs where one config controls whether we want to apply side effects (by default yes) and the warning level for side effects (warning for export and error for VLLM). We intentionally ignore input side effects, because they are captured in the graph and export would never trace the actual dynamo graph module when tracing the pytree calls).

Pull Request resolved: pytorch#167239
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
This reverts commit 406719c.

Reverted pytorch#166708 on behalf of https://github.com/jeanschmidt due to breaks internal signals, see D86606212 ([comment](pytorch#166708 (comment)))
…y_info (pytorch#162564)"

This reverts commit 3cfbf98.

Reverted pytorch#162564 on behalf of https://github.com/jeanschmidt due to seems to be breaking 1000s of internal build rules, see D86638790 ([comment](pytorch#156812 (comment)))
…h#156812)"

This reverts commit abf31db.

Reverted pytorch#156812 on behalf of https://github.com/jeanschmidt due to seems to be breaking 1000s of internal build rules, see D86638790 ([comment](pytorch#156812 (comment)))
…167335)

Modified cuda_to_hip_mappings.py to map cuSPARSELt headers and types to their hipSPARSELt counterparts, improving compatibility and functionality for ROCm users.

Pull Request resolved: pytorch#167335
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
This PR applies new Union and Optional typing syntax to some files.

Pull Request resolved: pytorch#167449
Approved by: https://github.com/albanD
Need to wait for:
Dao-AILab/flash-attention#1998 to land

Pull Request resolved: pytorch#167392
Approved by: https://github.com/jbschlosser
ghstack dependencies: pytorch#167348
This reverts commit c7007e7.

Reverted pytorch#167343 on behalf of https://github.com/jeffdaily due to causing ROCm distributed jobs to time out ([comment](pytorch#167343 (comment)))
…pytorch#164992)

I found that running any compiled function under DebugMode more than once will trigger recompilations, e.g. with the really simple modified test case in `test_compile`:
```
[0/1] [__recompiles] Recompiling function f in /data/users/pianpwk/ptclone/pytorch/test/distributed/tensor/debug/test_debug_mode.py:268
[0/1] [__recompiles]     triggered by the following guard failure(s):
[0/1] [__recompiles]     - 0/0:
[0/2] [__recompiles] Recompiling function f in /data/users/pianpwk/ptclone/pytorch/test/distributed/tensor/debug/test_debug_mode.py:268
[0/2] [__recompiles]     triggered by the following guard failure(s):
[0/2] [__recompiles]     - 0/1:
[0/2] [__recompiles]     - 0/0:
```

Digging deeper, the guard failures were due to TENSOR_MATCH guards failing on dispatch key set checks (seemingly on the Python dispatch key):
https://github.com/pytorch/pytorch/blob/5a1fbf45ad727353e367740ecd8825ca7ee857e9/torch/csrc/dynamo/guards.cpp#L199-L203

This seems to due to the `ignore_compile_internals=True` flag on custom dispatch modes being on, which causes these modes to "hide" themselves during compilation, making dynamo guard on the Python dispatch key being off.

The (maybe imperfect) solution is to mask out the Python keys for guard comparisons, when `_is_in_any_mode_without_ignore_compile_internals` is False.

Pull Request resolved: pytorch#164992
Approved by: https://github.com/williamwen42
Summary:
Update caffe2/torch/csrc to build under CUDA 13.

As of CUDA 13, CCCL v3 is the default, and as such, nvToolsExt.h has been moved to  nvtx3/nvtx3.hpp.

This is needed for building FBGEMM_GPU under CUDA 13 (see D86372925)

Test Plan:
```
# Default build
buck build --flagfile fbcode//mode/dev-nosan fbcode//caffe2:_C_impl
buck build --flagfile fbcode//mode/dev-nosan fbcode//caffe2:_C_impl_cuda

# CUDA 13 build
buck build  @//mode/opt -c fbcode.arch=aarch64 -c fbcode.nvcc_arch=b200 -c fbcode.platform010_cuda_version=13.0  fbcode//caffe2:_C_impl
buck build  @//mode/opt -c fbcode.arch=aarch64 -c fbcode.nvcc_arch=b200 -c fbcode.platform010_cuda_version=13.0  fbcode//caffe2:_C_impl_cuda
```

Differential Revision: D86517946

Pull Request resolved: pytorch#167401
Approved by: https://github.com/Skylion007
… is defined (pytorch#167496)

Fixes pytorch#161660

This extends the `TORCH_STABLE_ONLY` stopgap added in pytorch#161658

Pull Request resolved: pytorch#167496
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#167495
segfaults dont gen xml, so we get no info about them in clickhouse or in the xml or in the json, so this manually generates something and uploads it to s3 to be ingested

at some point some of the existing code for test reports should be changed to just use the json that gets uploaded in the job or something
Pull Request resolved: pytorch#167250
Approved by: https://github.com/huydhn
Summary: Improve compatibility with projects that have -Wswitch-default errors/warnings enabled by suppressing those errors/warnings in caffe2 headers.

Test Plan: CI Pass

Differential Revision: D86785451

Pull Request resolved: pytorch#167563
Approved by: https://github.com/shoumikhin
Implementation greatly adapted from @lw's pytorch#163505. TORCH_BOX is the StableIValue version of `make_boxed_from_unboxed_functor`.

the differences:
- uses headeronly concepts
- adds an unbox type mapping to support user kernels taking in torch::headeronly::HeaderOnlyArrayRef<T> (by calling to<std::vector<T>> in those cases)

Pull Request resolved: pytorch#167582
Approved by: https://github.com/swolchok
ghstack dependencies: pytorch#167386
…7397)"

This reverts commit 7886070.

Reverted pytorch#167397 on behalf of https://github.com/jeanschmidt due to seems to be breaking executorch signals internally, see D86780724 ([comment](pytorch#167397 (comment)))
…hapes (pytorch#166358)"

This reverts commit 416421c.

Reverted pytorch#166358 on behalf of https://github.com/jeanschmidt due to seems to be breaking internal signals, see D86790405, @angelayi may you help the author get this change landed? ([comment](pytorch#166358 (comment)))
Which is a regression introduced by pytorch#167046
That causes CuDNN SDPA fail with actionable `cuDNN Frontend error: [cudnn_frontend] Error: No valid execution plans built.` error

Change `cuda_libs` from dict to list, and add `test_sdpa` regression test to binary smoke tests

Fixes pytorch#167602
Pull Request resolved: pytorch#167614
Approved by: https://github.com/Aidyn-A, https://github.com/atalman, https://github.com/nWEIdia
…lectorCache.__call__ (pytorch#167487)

Summary:
What: moves `create_no_valid_choices` out of `AlgorithmSelectorCache.__call__` and into the body of `AlgorithmSelectorCache`
Why: nested function definitions make it harder to understand what `AlgorithmSelectorCache.__call__` is doing, on top of making patching/testing/etc more difficult

Test Plan: CI

Differential Revision: D86712921

Pull Request resolved: pytorch#167487
Approved by: https://github.com/aorenste
Summary:
Update caffe2/c10/cuda to build under CUDA 13

As of CUDA 13, the cudaMemAdvise() has been updated to take in `cudaMemLocation` as argument instead of `int` device id

This is needed for building FBGEMM_GPU under CUDA 13 (see D86372925)

Test Plan:
```
# Default build
buck build  @//mode/opt fbcode//caffe2/c10/cuda:cuda

# CUDA 13 build
buck build  @//mode/opt -c fbcode.arch=aarch64 -c fbcode.nvcc_arch=b200 -c fbcode.platform010_cuda_version=13.0  fbcode//caffe2/c10/cuda:cuda

# AMD build
buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/c10/cuda:cuda
```

Reviewed By: atalman

Differential Revision: D86578286

Pull Request resolved: pytorch#167534
Approved by: https://github.com/seemethere
Summary:
Folding logic on Matmal can be decomposed to BMM or folding + MM.

Current common Training path for 3D * 2D matmul: library will always fold, since Tensor1 or Tensor2 BOTH require a grad, so we fold since Tensor2 has grad.   But reasoning isn't really sound, it was done as a memory optimization - when its also generally same/more performant.

However, in Chemistry / Modular Modeling its common to directly calculate Forces as derivate of Energy (ie. dl/dX, but NOT dl/dW) in inference.  This exposed bug where we only have 1 of 2 Tensors requires grad, and may choose NOT to fold, resulting in 30% regression due to suboptimal BMM decomposition of torch.nn.Linear (-> calls into matmul).

I actually think even in cases we need either dl/dX or dl/dW, we should be folding when working with inputs of [B, M, N] and weights of [N, K].  Its strictly better for memory and same/faster when you consider both forward + backward runtime, and M's that are not multiples of 8 are particularly brutally slow using BMM vs MM.

Also, compiler out of box could not solve this issue, which raise another concern (was actually highlighted 2 years ago in comments, but seems still case today: (pytorch#118548 (comment))

Differential Revision: D86128493

Pull Request resolved: pytorch#166891
Approved by: https://github.com/ngimel
…ace (pytorch#167248)

Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.

This diff adds mutexes to synchronize access to the static maps.

Test Plan:
Use a GPU OD, run multi-threaded tests with TSAN:
```
buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
```
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

TSAN: P2026731804

Differential Revision: D86316117

Pull Request resolved: pytorch#167248
Approved by: https://github.com/Skylion007, https://github.com/malfet
…ytorch#167663)

Summary:
as title.

Test Plan:
CI

Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#167663
Approved by: https://github.com/tugsbayasgalan
Should be merged after pytorch#166561
Pull Request resolved: pytorch#166708
Approved by: https://github.com/malfet
oulgen and others added 21 commits November 18, 2025 05:18
Helps with reducing Dynamo tracing time. Earlier the generator object
would cause more polyfills.

Pull Request resolved: pytorch#168024
Approved by: https://github.com/williamwen42
Fix for this issue on DSV3 autobucketing pass: pytorch/torchtitan#2037; Now users should be able to run DSV3 autobucketing E2E.

It fixed three things:

(1) fix bug in NCCL estimation support for All-to-all.

(2) For dynamic token dispatch/combine in MoE, add fall_back value hint to all-to-all's collective size estimation.

(3) Previously, for schedulable node check, I directly modified `is_wait` in bucketing.py. It might be safer to add these criteria in overlap_scheduling.py as another function `_schedulable_wait_node`

Pull Request resolved: pytorch#167797
Approved by: https://github.com/eellison
This is tested by pytorch#167962 which ensures we get compilation errors when using functions that convert Device/HeaderOnlyArrayRef to StableIValue and target 2.9

Pull Request resolved: pytorch#167802
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#168025
Tests are split into libtorch_agnostic_2_9_extension and libtorch_agnostic_2_10_extension depending on the minimum version they should compile+run in

Pull Request resolved: pytorch#167803
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#168025, pytorch#167802
…rsion (pytorch#167804)

Adds a CI workflow that tests the wheel built on current main targeting 2.9 with a 2.9 runtime

Pull Request resolved: pytorch#167804
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#168025, pytorch#167802, pytorch#167803
…#167962)

Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone

Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)

Pull Request resolved: pytorch#167962
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#168025, pytorch#167802, pytorch#167803, pytorch#167804
This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark.

Pull Request resolved: pytorch#167734
Approved by: https://github.com/eqy, https://github.com/cyyever
This PR outputs chars to stream without building temporary strings.
They were modified by (on fish)
```
sed  -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l)
```
and revert some invalid changes.

Pull Request resolved: pytorch#167899
Approved by: https://github.com/Skylion007
Upgrade all the ROCm docker images to ROCm 7.1 release version.

Pull Request resolved: pytorch#166743
Approved by: https://github.com/atalman

Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Prachi Gupta <[email protected]>
Removed distributed related paths from labeler configuration.

Pull Request resolved: pytorch#168084
Approved by: https://github.com/wconstab
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	requirements.txt
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 18, 2025

Jenkins build for da5ac4a82178862a6da89a7b573bdba2c4f6c3c0 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#61 15.40 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= git clone --recursive https://github.com/ROCm/triton triton
#61 15.41 Cloning into 'triton'...
#61 26.15 + cd triton
#61 26.15 + as_jenkins git checkout '<<<<<<<' HEAD ac80c4190aa0321f761a08af97e1e1eee41f01d9 ======= bfeb066872bc1e8b2d2bc0a3b295b99dd77206e7 '>>>>>>>' upstream/main
#61 26.15 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= git checkout '<<<<<<<' HEAD ac80c4190aa0321f761a08af97e1e1eee41f01d9 ======= bfeb066872bc1e8b2d2bc0a3b295b99dd77206e7 '>>>>>>>' upstream/main
#61 26.16 error: pathspec '<<<<<<<' did not match any file(s) known to git
#61 26.16 error: pathspec 'HEAD' did not match any file(s) known to git
#61 26.16 error: pathspec 'ac80c4190aa0321f761a08af97e1e1eee41f01d9' did not match any file(s) known to git
#61 26.16 error: pathspec '=======' did not match any file(s) known to git
#61 26.16 error: pathspec 'bfeb066872bc1e8b2d2bc0a3b295b99dd77206e7' did not match any file(s) known to git
#61 26.16 error: pathspec '>>>>>>>' did not match any file(s) known to git

To keep triton version consistent with what is in rocm/triton's
release/internal/3.5.x branch, we need to keep triton_version.txt at
3.5.0 and move triton hash to ToT of that branch.
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Nov 19, 2025

Jenkins build for a3c49a95de48914e369aa08899a683c2db88ed5f commit finished as SUCCESS
Links: Blue Ocean view / Build artifacts

@pragupta pragupta merged commit 5ca076d into develop Nov 19, 2025
28 checks passed
@pragupta pragupta deleted the develop_IFU_20251118 branch November 19, 2025 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.